Naive Bayes

by Mrunal Jadhav,Pritish Jadhav - Mon, 18 May 2020
Tags: #python #numpy #probability #linear algebra #multiclass #supervised learning

$\huge Naive\ Bayes$

Naive Bayes is a simple and powerful classification algorithm. It belongs to the family of probabilistic algorithms that take advantage of Probability Theory and Bayes Theorem to predict the class. We calculate the probability of each tag, given the set of input features.

Bird's Eye View of this Blog

  1. Read and Analyse the Data
  2. Understanding Law of Total Probability and Bayes Rule using our Dataset as Example
  3. Applying Naive Bayes on Dataset

Naive Bayes Classifier is based on naïve independence assumptions and Conditional Probabilities. Want to go-over these concepts quickly ? Don't Worry!! We've got you covered. We recommend you reading our blog on Conditional Probabilites. :)

In [5]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

from IPython.display import HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
# display(HTML("<style>.container { width:100% !important; }</style>"))

1. Read and Analyse Data

We will use the Iris Flower Classification Dataset.The aim is to classify iris flowers among three species (setosa, versicolor or virginica) from measurements of length and width of sepals and petals. The goal here is to model the probabilities of class membership, conditioned on the flower features.

In [3]:
filepath="Iris.csv"
data_dict=pd.read_csv(filepath)
In [40]:
display(data_dict.head())
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species Label
0 1 5.1 3.5 1.4 0.2 Iris-setosa 0
1 2 4.9 3.0 1.4 0.2 Iris-setosa 0
2 3 4.7 3.2 1.3 0.2 Iris-setosa 0
3 4 4.6 3.1 1.5 0.2 Iris-setosa 0
4 5 5.0 3.6 1.4 0.2 Iris-setosa 0
In [5]:
display(data_dict['Species'].value_counts())
Iris-setosa        50
Iris-virginica     50
Iris-versicolor    50
Name: Species, dtype: int64
In [6]:
categories = list(data_dict['Species'].unique())
sns.set(font_scale = 1.2)
plt.figure(figsize=(8,2.5))
ax= sns.barplot(categories, data_dict['Species'].value_counts())

Encoding the labels of Dataset

In [7]:
le=LabelEncoder()
data_dict['Label']=le.fit_transform(data_dict['Species'])
le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(le_name_mapping)
{'Iris-setosa': 0, 'Iris-versicolor': 1, 'Iris-virginica': 2}
In [39]:
display(data_dict.head())
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species Label
0 1 5.1 3.5 1.4 0.2 Iris-setosa 0
1 2 4.9 3.0 1.4 0.2 Iris-setosa 0
2 3 4.7 3.2 1.3 0.2 Iris-setosa 0
3 4 4.6 3.1 1.5 0.2 Iris-setosa 0
4 5 5.0 3.6 1.4 0.2 Iris-setosa 0

2. Law of Total Probability and Bayes Rule

2.1 Law of Total Probability

Let us consider the following hypothesis

  • H1- The sample belongs to class Iris-setosa
  • H2- The sample belongs to class Iris-versicolor
  • H3- The sample belongs to class Iris-virgibica

Here the it is assumed that the Hypothesis satisfies following conditions :

  • $H1 \cap H2 \cap H3$= NULL
    The Hypothesis H1, H2 and H3 are mutually independent. That is a flower strictly belongs to one class. It cannot satisfy two hypothesis at a time.
  • $H1 \cup H2 \cup H3$= Sample Space
    All the data points in our dataset must satisfy any one of the hypothesis.

Calculate Probabilities

Now, we proceed to calculate the probabilities of our Hypothesis based on our Dataset

  • P(H1)- Proability of Iris-setosa
  • P(H2)- Probability of Iris-versicolor
  • P(H3)- Probability of Iris-virginica

The probabilities can be visualised as follows:

The Hypothesis generates non-overlapping partitions in probability space

Conditional Probabilities

Now, let us assume that our hypothesis is based on the condition C,

  • C - The height of the plant is small

From the data we can figure out the following probabilities.

  • P(C|H1)- Of the Setosa species, the probability of the plants with small height.
  • P(C|H2)- Of the Vesicolor species, the probability of the plants with small height.
  • p(C|H3)- Of the Virginica species, the probability of the plants with small height.

Suppose these probabilities are as follows:

  • P(C|H1)= 0.1
  • P(C|H2)= 0.8
  • P(C|H3)=0.7

The probabilities can be visualised as follows:

Probability of C can be calculated as : Addition of Areas of 3 Rectangles in Dark Red,Dark Yellow and Dark Purple \begin{align} P(C)=(0.3333*0.1)+(0.3333*0.8)+(0.3333*0.7) \end{align}
\begin{align} P(C)=P(H1\ \cap\ C)\ \cup\ P(H2\ \cap\ C)\ \cup\ P(H3\ \cap\ C) \end{align}
Since the sets are disjoint, we can rewrite this as follows:


\begin{align} P(C)=P(H1\ \cap\ C)\ +\ P(H2\ \cap\ C)\ +\ P(H3\ \cap\ C) \end{align}

Now, using the definition of conditional probability we can write the intersection as product of two sets as follows :

{Geometrical Justification : Area Of Rectangle $P(H1\ \cap\ C)\ =\ Length((P(H1))BreadthP(C|H1))$*}



\begin{align} P(C)=(P(H1)*P(C|H1))\ +\ (P(H2)*P(C|H2))\ +\ (P(H3)*P(C|H3)) \end{align}

Thus, we have recovered the total probability of C. We did this using Conditional Probabilities of C with three events(Hypothesis) H1,H2,H3.

Now, we ask the inverse question,

Assuming that we know that the plant is small, what are the respective probabilities for plant being Setosa, Versicolor or Virginica?????

2.2 Bayes Rule

In Bayes Rule, if we pick a random sample of iris flower and the condition C (The height of plant is small) is True, then we determine the probabilities of the following :

  • P(H1|C) - The small plant sample is setosa
  • P(H2|C) - The small plant sample is versicolor
  • P(H3|C) - The small plant sample is virginica

To find P(H1|C), we have to find the proportion of intersection of H1 with C to the total area of C

\begin{align} P(H1|C)\ =\ \frac{P(H1\cap C)}{P(C)} \end{align}


We know that,

\begin{align} P(H1\cap C)=\ P(C|H1)\ *\ P(H1) \tag{Area of Rectangle} \end{align}

This Gives the Bayes Rule :

3. Applying Naive Bayes on Our Dataset

Step 3.1 Train Test Split

In [25]:
data_train,data_test=train_test_split(data_dict.iloc[:, 1:])

Step 3.2 Calculate Probabilities

In [99]:
def calculate_probability(E,S):
    return float(E/S)
In [100]:
n_total=data_train['Label'].count()

n_setosa=data_train['Label'][data_train['Label']==0].count()
p_setosa=calculate_probability(n_setosa,n_total)
print("The probaility of Hypothesis H1- Sample is Setosa :", p_setosa)

n_versicolor=data_train['Label'][data_train['Label']==1].count()
p_versicolor=calculate_probability(n_versicolor,n_total)
print("The probaility of Hypothesis H2- Sample is Versicolor :", p_versicolor)

n_virginica=data_train['Label'][data_train['Label']==2].count()
p_virginica=calculate_probability(n_virginica,n_total)
print("The probaility of Hypothesis H3- Sample is Virginica :", p_virginica)
The probaility of Hypothesis H1- Sample is Setosa : 0.33035714285714285
The probaility of Hypothesis H2- Sample is Versicolor : 0.29464285714285715
The probaility of Hypothesis H3- Sample is Virginica : 0.375

Step 3.2 Calculate Likelihood P(C|H1), P(C|H2), P(C|H3)

As we can see here, the features of the data- SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm are continuous random variables. A continuous random variable can take infinite number of values within the continuous range of real numbers.


To understand the distribution of random variables, let us visualise using histogram.

In [28]:
features=data_train.columns[:-1]
count=0
fig, axes = plt.subplots(2, 2,figsize=(10,10))
fig.subplots_adjust(hspace=0.3, wspace=0.3)
for i in range(2):
    for j in range(2):
        sns.distplot(data_train[features[count]], hist=True, kde=True,bins=int(180/5), color = 'darkblue', hist_kws={'edgecolor':'black'},kde_kws={'linewidth': 4},ax=axes[i, j])
        count +=1
plt.show()

It is evident that :

  • SepalLength, SepalWidth have a Normal Distribution (Unimodal)
  • PetalLength, PetalWidth have a BiModal Distribution

    We will leverage PDF to model these Random Variables.

    Probability Distribution Function (PDF)- PDF is a function f(x) of a random variable, x, and its magnitude is an indication of the relative likelihood of measuring a particular value.

\begin{align} f(x)\ =\ \frac{e^{p}}{\sqrt{2*\ \pi*\ var x}} \\ where\ p\ =\ \frac{-(x\ -mean(x))^2}{2\ * \ var(x)} \end{align}

Note : For Simplicity, we will use the PDF of Normal Distribution for all the features.
We have to Calculate P(C|H1), we group mean and variances according to the hypothesis.

Calculate Mean

In [36]:
data_means = data_train.groupby('Label').mean()
# View the values
display(data_means)
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
Label
0 5.062162 3.483784 1.500000 0.251351
1 5.872727 2.718182 4.190909 1.287879
2 6.664286 2.988095 5.597619 2.030952

Calculate Variance

In [37]:
data_variance = data_train.groupby('Label').var()
# View the values
display(data_variance)
SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
Label
0 0.104640 0.125841 0.022778 0.012012
1 0.227670 0.109034 0.191477 0.032973
2 0.365767 0.109855 0.311458 0.081702

Define a Function for Calculating Likelihood Using PDF

In [41]:
# Create a function that calculates p(x | y):
def p_x_given_y(x, mean_y, variance_y):

    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    
    # return p
    return p

Step 3.3 Calculate Posterior Probabilities

In [51]:
def calculate_posterior(features_c,feature_names,prior_prob,num_hypothesis):
    hypothesis_probabilities=np.zeros(num_hypothesis,)
    for i in range(num_hypothesis):
        
        ##calculate the likelihood by multiplying for all features 
        likelihood=1
        for j in range (len(feature_names)):
            x=features_c[j]
            data_mean_x=data_means[feature_names[j]][data_means.index == i].values[0]
            data_var_x= data_variance[feature_names[j]][data_variance.index == i].values[0]
            likelihood= likelihood* p_x_given_y(x,data_mean_x,data_var_x)
            
        hypothesis_probabilities[i]=prior_prob[i] * likelihood
    
    return hypothesis_probabilities.argmax()

Step 3.4 Predict on Test Data

In [126]:
def calculate_test_prediction(data_test,prior_prob,num_hypothesis,feature_names):    
    data_test_pred = []
    count=0
    for i in range(len(data_test)):
        features_c=data_test.iloc[i,:-2]
        y_pred=calculate_posterior(features_c,feature_names,prior_prob,num_hypothesis)
        
        data_test_pred.append(y_pred)
        if(y_pred==data_test['Label'].iloc[i]):
            count+=1
        
    accuracy=float(count/len(data_test))*100
        
    return data_test_pred,accuracy
In [127]:
data_test_pred,acc=calculate_test_prediction(data_test,[p_setosa,p_versicolor,p_virginica],3,data_means.columns)
In [128]:
acc
Out[128]:
86.8421052631579

Comments